I will use the dataset on bridge condition.
bridges <- read.csv("https://sldr.netlify.app/data/bridges.csv") |>
# remove closed bridges (Condition.Rating = 0)
filter(Condition.Rating >= 1) |>
filter(SD.FO.Status == "N") |> # remove struct deficient and funct obsolete
# remove bridges supposedly built after data collection
filter(Year.Built <= 2015) |>
mutate(Year.Built.Z = as.numeric(scale(Year.Built)),
Inspection.Weeks = as.numeric(
difftime(lubridate::ymd('2015-01-01'),
lubridate::mdy(Date.Last.Inspected),
units = 'weeks')),
Inspection.Weeks.Z = as.numeric(scale(Inspection.Weeks)))
I will model Condition.Rating as a function of
Year.Built.Z and Inspection.Weeks.Z, thinking
that
I think that older age will result in deteriorated bridge condition, and for bridges of a certain age, those inspected more recently may be in better shape. Condition will vary by location and owner (and bridge age may be different depending on location and owner too). It seems there are bridges of different types and sizes; different kinds of bridges may be built more at certain times than others, and bridge type may vary by location too. Bridge type may affect condition and also interact with age (with the amount of deterioration with age being different for different bridge types). Note, the graph below can’t depict the interaction - but note it makes Type a modifier of the age - condition relationship.
So: modeling condition as a function of just age and weeks-since-inspection is clearly an oversimplification, and leaves out key confounders and a moderator! (Wish we already knew how to include more variables in our model…)
\[\text{Condition}_i \sim \text{Normal}(\mu_i, \sigma)\] \[\mu_i \sim \beta_0 + \beta_1 * \text{Year.Built.Z} + \beta_2 * \text{Inspection.Weeks.Z}\] \[\beta_0 \sim \text{Normal}(6,1)\] \[\beta_1 \sim \text{Normal}(1,1)\] \[\beta_2 \sim \text{Normal}(0, 0.5)\] \[\sigma \sim \text{Normal}(1.5, 2)\]
Rationale:
gf_point(Condition.Rating ~ Year.Built.Z, data = bridges)
gf_point(Condition.Rating ~ Inspection.Weeks.Z, data = bridges)
Having peeked at the data, I have some more doubts. The trend with year built does not look linear. So far we have several reasons to expect the model we’re planning won’t match the data well, and would probably go back to the planning stage if this were not a teaching example.
I want to generate a set of simulated Condition.Rating
values that are consistent with my prior. Here is one such
dataset the same size as the real data:
# draw a simulated slope and intercept from priors
b0 <- rnorm(1, mean = 6, sd = 1)
b1 <- rnorm(1, mean = 1, sd = 1)
b2 <- rnorm(1, mean = 0, sd = 0.5)
sigma <- rnorm(1, mean = 1.5, sd = 2)
prior_sim <- tibble(
# use the Year and inspection data from the actual dataset
Year.Built.Z = bridges$Year.Built.Z,
Inspection.Weeks.Z = bridges$Inspection.Weeks.Z,
Sim.Condition = rnorm(nrow(bridges),
mean = b0 + b1 * Year.Built.Z + b2 * Inspection.Weeks.Z,
sd = sigma)
)
gf_point(Sim.Condition ~ Year.Built.Z, data = prior_sim)
gf_point(Sim.Condition ~ Inspection.Weeks.Z, data = prior_sim)
If we run this a bunch of times to get a sense of the different possibilities implied by my priors, we see that condition ratings within the realistic limits (between 2 and 7) are definitely possible under these priors, and the slopes seen in the simulated data seem within the range of plausibility quite often. (note that when I was working on the non-standardized scale for the covariates, I did not do such a good job with the priors!)
We still might want to use other distributions in a way that would enforce the real limits on the support of the condition ratings (they are all between 1-7). We also might want to use even less diffuse (more informative) priors, if we can manage to define them.
But for now let’s just continue…
bridge_mod <- quap(
flist = alist(
Condition.Rating ~ dnorm(mu, sigma),
mu <- b0 + b1 * Year.Built.Z + b2 * Inspection.Weeks.Z,
b0 ~ dnorm(6,1),
b1 ~ dnorm(1,1),
b2 ~ dnorm(0,1),
sigma ~ dnorm(1.5, 2)
),
data = bridges)
psamp <- extract.samples(bridge_mod, n = 10000)
gf_dens(~b0, data = psamp, title = 'Intercept\n(Condition.Rating of Bridge with average Year Built and Weeks Since Inspection)')
## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
gf_dens(~b1, data = psamp, title = 'Slope\n(Increase in Condition.Rating for 1-year Increase in Year.Built.Z)')
gf_dens(~b2, data = psamp, title = 'Slope\n(Increase in Condition.Rating for 1-year Increase in Inspection.Weeks.Z)')
gf_dens(~sigma, data = psamp, title = 'Sigma\n(std dev of Condition for oa specific Year Built and Weeks Since Inspection)')
What’s this model doing? It’s quantifying the linear relationships between the bridge condition and the predictor and covariate, year built and inspection time.
Can we show all this in one graph? Honestly, not easily. And here we only have our main predictor plus one covariate. If we had many, it would be even harder.
We might consider drawing the lines corresponding to our posterior sample to visualize what the model says about the relationship of age and condition, or age and inspection time. Notice I’m not overlaying this on the real data anymore, because in the real data the condition is (we think) being affected by both age and inspection, and inspection is not accounted for in this plot.
This graph shows the effect of Year.Built.Z on the
condition for bridges of a certain time-since-inspection. Or,
in other words, the effect of condition having accounted for the effect
of Inspection.Weeks.Z.
# first make a smaller posterior sample so R won't crash
psamp_sm <- extract.samples(bridge_mod, n = 1000)
# gf_abline seems like it won't work on its own, needs a blank canvas first?
gf_blank(Condition.Rating ~ Year.Built.Z, data = bridges) |>
gf_abline( # draws "line"s given "a" (slope) and "b" (intercept)
slope = ~b1, # draws one line for each row of psamp
intercept = ~b0,
data = psamp_sm,
alpha = 0.05, # closer to 0 = more transparent
linewidth = 0.1 # smaller = thinner
)
gf_blank(Condition.Rating ~ Inspection.Weeks.Z, data = bridges) |>
gf_abline( # draws "line"s given "a" (slope) and "b" (intercept)
slope = ~b2, # draws one line for each row of psamp
intercept = ~b0,
data = psamp_sm,
alpha = 0.05, # closer to 0 = more transparent
linewidth = 0.1 # smaller = thinner
)
A posterior predictive distribution would be made up of many fake
datasets, each one including nrow(bridges) simulated
Condition.Rating values. And to simulate new datapoints,
we’d take into account the posterior for \(\sigma\) as well as those of the
slope and intercept.
You can do this as before, but you can’t display the results in a single graph, and interpretation is more complicated just as it was for the “line” plots above.
psamp_sm <- psamp_sm |>
# add row numbers to identify each draw from the posterior
mutate(sim_id = c(1:nrow(psamp_sm))) |>
rowwise() |> # one simulation per draw from posteriors
mutate(
# mean_condition is the expected response
# it's a list with one value per row of original data, `bridges`
mean_condition = list(b0 +
b1 * bridges$Year.Built.Z +
b2 * bridges$Inspection.Weeks.Z),
# store predictor data too to make graphs easier later
year_built = list(bridges$Year.Built.Z),
inspection_weeks = list(bridges$Inspection.Weeks.Z)) |>
mutate(
# use sigma to simulate how far datapoints are from expected response
sim_condition = list(mean_condition +
rnorm(n = nrow(bridges), mean = 0, sd = sigma))
) |>
unnest(cols = c(year_built, inspection_weeks,
mean_condition, sim_condition)) |>
ungroup()
We now have many simulated datasets generated based on our fitted model. The simulated datasets look how we’d expect real data to look, if the model were true.
How can we visualize many datasets at once? We can look at just the first 10 datasets to get an idea of what we have done:
gf_point(sim_condition ~ year_built | sim_id,
data = psamp_sm |> filter(sim_id <= 10),
alpha = 0.1) |>
# dashed lines show ranges observed in real data
# (not needed if there are not lower/upper bounds on possible values of response)
gf_hline(yintercept = 7, linetype = 'dashed') |>
gf_hline(yintercept = 2, linetype = 'dashed')
How does this compare to the actual data?
gf_point(Condition.Rating ~ Year.Built.Z, data = bridges)
The simulated data are more patternless (aside from the linear trend), and go above the top condition rating of 7. The real data has many more very old bridges in good condition (simulated data has none) but otherwise does not look terribly different from the real data.
How else could we show “all” the posterior predictive distribution?
One idea is to make a scatter plot where color intensity corresponds to the number of datapoints in each location. We could then overlay the actual data on top.
Note: this takes a loooong time when the dataset is as large as this one.
# we need an even smaller dataset to make this plot without crashing R. let's try using just the first 200 rows of psamp_sm
psamp_vsm <- slice_head(psamp_sm, n = 200)
gf_density2d_filled(sim_condition ~ year_built,
data = psamp_vsm) |>
gf_theme(scale_fill_grey('')) |> # use greyscale color palette
gf_refine(guides(fill = FALSE)) |> # don't show color legend
# (optional) add the real data
gf_point(Condition.Rating ~ Year.Built.Z, data = bridges,
color = 'goldenrod',
size = 1, alpha = 0.1) # VERY nearly transparent
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
This is still kind of useful, because we can see where our model predicts there will be data (and there isn’t) and where our model thinks there will NOT be data (and there is).
We could make a similar plot for the inspection covariate too:
gf_density2d_filled(sim_condition ~ inspection_weeks,
data = psamp_vsm) |>
gf_theme(scale_fill_grey('')) |> # use greyscale color palette
gf_refine(guides(fill = FALSE)) |> # don't show color legend
# (optional) add the real data
gf_point(Condition.Rating ~ Inspection.Weeks.Z, data = bridges,
color = 'goldenrod',
size = 1, alpha = 0.1) # VERY nearly transparent